Scalable Web Data Extraction for Online Market Intelligence

نویسندگان

  • Robert Baumgartner
  • Georg Gottlob
  • Marcus Herzog
چکیده

Online market intelligence (OMI), in particular competitive intelligence for product pricing, is a very important application area for Web data extraction. However, OMI presents non-trivial challenges to data extraction technology. Sophisticated and highly parameterized navigation and extraction tasks are required. On-the-fly data cleansing is necessary in order two identify identical products from different suppliers. It must be possible to smoothly define data flow scenarios that merge and filter streams of extracted data stemming from several Web sites and store the resulting data into a data warehouse, where the data is subjected to market intelligence analytics. Finally, the system must be highly scalable, in order to be able to extract and process massive amounts of data in a short time. Lixto (www.lixto.com), a company offering data extraction tools and services, has been providing OMI solutions for several customers. In this paper we show how Lixto has tackled each of the above challenges by improving and extending its original data extraction software. Most importantly, we show how high scalability is achieved through cloud computing. This paper also features a case study from the computers and electronics market.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Web Data Extraction for Business Intelligence: The Lixto Approach

Knowledge about market developments and competitor activities on the market becomes more and more a critical success factor for enterprises. The World Wide Web provides public domain information which can be retrieved for example from Web sites or online shops. The extraction from semi-structured information sources is mostly done manually and is therefore very time consuming. This paper descri...

متن کامل

Data Extraction using Content-Based Handles

In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...

متن کامل

The Lixto Systems Applications in Business Intelligence and Semantic Web

This paper shows how technologies for Web data extraction, syndication and integration allow for new applications and services in the Business Intelligence and the Semantic Web domain. First, we demonstrate how knowledge about market developments and competitor activities on the market can be extracted dynamically and automatically from semi-structured information sources on the Web. Then, we s...

متن کامل

Improving the Efficiency of Online Advertisement Targeting via Artificial Intelligence Analysis of User’s Web Surf History

The demand and popularity of on-line advertising has never been higher. As a result, the industry has experienced an enormous influx of capital resulting in a highly competitive environment. The difference between success and failure in this market often depends on an ad publisher’s ability to successfully deliver advertisements that match the interests of their users. We propose and test a new...

متن کامل

Business and Market Intelligence 2.0

to the “skills, technologies, applications, and practices used to support decision making” (http:// en.wikipedia.org/wiki/Business_intelligence). On the basis of a survey of 1,400 CEOs, the Gartner Group projected BI revenue to reach $3 billion in 2009.1 Through BI initiatives, businesses are gaining insights from the growing volumes of transaction, product, inventory, customer, competitor, and...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • PVLDB

دوره 2  شماره 

صفحات  -

تاریخ انتشار 2009